Cooperative Distributed GPU Power Capping for Deep Learning Clusters

نویسندگان

چکیده

The recent GPU-based clusters that handle deep learning (DL) tasks have the features of GPU device heterogeneity, a variety neural network (DNN) models, and high computational complexity. Thus, traditional power capping methods for CPU-based or small-scale devices cannot be applied to handling DL tasks. This article develops cooperative distributed (CD-GPC) system clusters, aiming minimize training completion time invoked without exceeding limited budget. Specifically, we first design frequency scaling approach using online model estimation based on recursive least square method. achieves accurate tuning task usage needing offline profiling. Then, formulate proposed FS problem as Lagrangian dual decomposition-based economic predictive control large-scale heterogeneous clusters. We conduct both NVIDIA lab-scale real experiments job trace-based simulation performance evaluation. Experimental results validate improves accuracy mean absolute error $< \!1\%$ , reduces deadline violation ratio by 21.5% compared with other counterparts.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Deep learning models can take weeks to train on a single GPU-equipped machine, necessitating scaling out DL training to a GPU-cluster. However, current distributed DL implementations can scale poorly due to substantial parameter synchronization over the network, because the high throughput of GPUs allows more data batches to be processed per unit time than CPUs, leading to more frequent network...

متن کامل

Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the trainin...

متن کامل

Distributed Learning for Cooperative Inference

Abstract We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Industrial Electronics

سال: 2022

ISSN: ['1557-9948', '0278-0046']

DOI: https://doi.org/10.1109/tie.2021.3095790